Objective of Assignment 2

To take the normalized expression data that was created in Assignment #1 and rank the genes according to differential expression. Once the gene list is ranked, a thresholded over-representation analysis is performed to highlight dominant themes in the top set of genes.

Introduction

The dataset is derived from genetically engineered Foxd1Cre;Smo(flox/-) house mice and aims to investigate the mRNA content of mutant kidneys to gain a deeper understanding of the molecular mechanisms underlying the developmental defects observed in the kidney. The Hedgehog-GLI pathway, which is responsible for the development and patterning of various tissues and organs, including the kidney, interacts with the transforming growth factor beta (TGF) signaling pathway, another critical regulator of kidney development. Smo, a key component of the Hedgehog-GLI signaling pathway, plays a crucial role in human kidney development and modulates TGFβ signaling in the kidney, and its disruption can lead to abnormal kidney development and function. The ultimate goal of understanding the role of Smo in these signaling pathways is to uncover the molecular mechanisms underlying normal kidney development and diseases, which could ultimately lead to the development of new therapeutic strategies for kidney disorders.

In the previous work on the dataset with accession ID GSE103923, an initial processing of the data has been done, including accessing the dataset quality, filtering the low expression gene. The dataset was downloaded in a normalized data with identified as a mixed of HUGO symbol and he Mouse Genome Informatics (MGI) symbol. Hence, a identifier mapping was performed, and no extra normalization was applied. The result of assignment one is a a clean, normalized dataset stored in the root directory as “smo_exp_normalized.txt”.

Preparation

Tools are needed for a differential expression analysis and a threshold over-representation analysis for our normalized dataset created in Assignment 1:

  • Packages used
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
library(BiocManager)
if (!requireNamespace("GEOmetadb", quietly = TRUE)) 
  BiocManager::install("GEOmetadb")
library(GEOmetadb)
if (!requireNamespace("biomaRt", quietly = TRUE)) 
  BiocManager::install("biomaRt")
library(biomaRt)
if (!requireNamespace("edgeR", quietly = TRUE)) 
  BiocManager::install("edgeR")
library(edgeR)
if (!requireNamespace("circlize", quietly = TRUE))
    install.packages("circlize")
library(circlize)
if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
    BiocManager::install("ComplexHeatmap")
library(ComplexHeatmap)
if (!requireNamespace("gprofiler2", quietly = TRUE))
    BiocManager::install("gprofiler2")
library(gprofiler2)
if (!requireNamespace("knitr", quietly = TRUE))
  install.packages("knitr")
library(knitr)
if (!requireNamespace("ggplot2", quietly = TRUE))
  install.packages("ggplot2")
library(ggplot2)
if (!requireNamespace("splines", quietly = TRUE))
  install.packages("splines")
library(splines)
if (!requireNamespace("GEOquery", quietly = TRUE))
  BiocManager::install("GEOquery")
library(GEOquery)
if (!requireNamespace("Biobase", quietly = TRUE))
  BiocManager::install("Biobase")
library(Biobase)
if (!requireNamespace("dplyr", quietly = TRUE))
  install.packages("dplyr")
library(dplyr)
if (!requireNamespace("kableExtra", quietly = TRUE)){
  install.packages("kableExtra")}
## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
## %in% : 'length(x) = 3 > 1' in coercion to 'logical(1)'
library(kableExtra)

Prepare dataset from Assignment 1

The result dataset from assignment 1 is stored in our root directory as “smo_exp_final.txt”

# Retreive the Assignment 1 result dataset
normalized_counts <- read.table("smo_exp_normalized.txt")
# Inspect the dataset
kable(normalized_counts[1:5,1:4], format = "html")
MUT1 MUT2 MUT3 MUT4
A2m 8.594266 19.16863 8.961176 10.73999
A4galt 171.885312 128.85579 144.498960 128.87987
Aaas 543.348570 562.27980 541.030990 543.44347
Aacs 340.905869 363.13903 375.249237 379.12163
Aagab 550.987917 514.35822 567.914518 459.67155
kable(normalized_counts[1:5,5:8], format = "html")
WT1 WT2 WT3 WT4
A2m 8.506106 18.60708 11.62576 20.68402
A4galt 127.591595 138.51935 116.25760 131.62556
Aaas 507.814548 516.86324 516.37751 546.24606
Aacs 346.198528 350.43328 351.67924 331.88444
Aagab 495.055389 528.23424 450.49820 448.46708

Differential Gene Expression

Two steps will be performed in this step: * Step 1: Create a design model to be used for calculating differential expression, * Step 2: Perform an analysis of differential expression using the normalized expression data obtained in A1.

Step 1: Create a design matrix

  • This dataset has only one factor, the Smo knockout mutant type.
  • The control will be the wild-type samples.
  • inspect our dataset and refer back to the MDS plot from A1 to show which factors included in my model.
# Calculate distances and plot MDS
dist <- dist(t(log2(cpm(normalized_counts) + 1)))
plotMDS(dist, 
        labels=colnames(normalized_counts), 
        col=ifelse(grepl("WT",colnames(normalized_counts)), "#00FFFF", "#FF00BF"), 
        pch=16, main="Normalized MDS plot of Smo(flox/-) mutants and wild type")
Figure 1. Multidimensional Scaling (MDS) plot showing the clustering of samples based on gene expression data. Each point represents a sample, colored according to the treatment group, cyan = Smo-knockout (MUT), pink = wild type (WT). The plot reveals clear separation of the knockout and wild type samples, indicating significant differences in gene expression between the two groups.

Figure 1. Multidimensional Scaling (MDS) plot showing the clustering of samples based on gene expression data. Each point represents a sample, colored according to the treatment group, cyan = Smo-knockout (MUT), pink = wild type (WT). The plot reveals clear separation of the knockout and wild type samples, indicating significant differences in gene expression between the two groups.

Create design matrix with genotype factor

  • use a linear model.
  • The genotype factor represents the two different types of samples (Smo-knockout and wild type).
# define the groups
samples <- data.frame(
  lapply(colnames(normalized_counts), FUN=function(x) {
    if (grepl("MUT", x)) {
      # If the column name contains "MUT", extract the characters following "MUT"
      "MUT"
    } else if (grepl("WT", x)) {
      # If the column name contains "WT", extract the characters following "WT"
      "WT"
    }
  })
)
# set row and column names
colnames(samples) <- colnames(normalized_counts)
rownames(samples) <- c("genotype")
samples <- data.frame(t(samples))
#inspect
samples[1:8,]
## [1] "MUT" "MUT" "MUT" "MUT" "WT"  "WT"  "WT"  "WT"
# create a design matrix of a linear model
design <- model.matrix(~ samples$genotype)
# inspect
kable(design[2:7,], type ="html")
(Intercept) samples$genotypeWT
2 1 0
3 1 0
4 1 0
5 1 1
6 1 1
7 1 1

check assumption

  • perform diagnostic Mean-Variance plot for our dataset
  • Verify the assumption that our dataset to be negative-binomially distributed to use a linear model
# create a DGEList object
dge <- DGEList(counts = normalized_counts, group = factor(rep(c("MUT", "WT"), each = 4)))

# filter low-expressed genes
keep <- rowSums(cpm(dge) > 1) >= 2
dge <- dge[keep,]

# normalize data
dge <- calcNormFactors(dge)

# estimate dispersion
dge <- estimateDisp(dge)

# estimate tagwise dispersion
tagwise_disp <- estimateTagwiseDisp(dge)

# generate MV plot
plotMeanVar(dge, 
            show.raw.vars = TRUE,
            show.tagwise.vars = TRUE,
            NBline = TRUE,
            show.ave.raw.vars = TRUE,
            show.binned.common.disp.vars = TRUE,
            main = "Mean-Variance Plot of Smo(flox/-) mutants and wild type")
# add legend
legend("topleft", 
       legend=c("Raw Data", "Tagwise Dispersion", "Average Raw Variances", 
                "Binned Common Dispersion", "Negative Binomial Line"), 
       col = c("grey", "lightblue", "maroon", "red", "dodgerblue2"), 
       pch=c(1,1,4,4,NA), lty=c(0,0,0,0,1), lwd=c(1,1,1,1,2), cex=0.6)
Figure 2. Mean-variance plot showing the distribution of data. The dispersion and variance of the data  follows the negative binomial distribution, in which the raw data and the blue negative binomial line alligns.

Figure 2. Mean-variance plot showing the distribution of data. The dispersion and variance of the data follows the negative binomial distribution, in which the raw data and the blue negative binomial line alligns.

Step 2: Differential Expression Analysis

  1. Calculate p-values for each of the genes in your expression set. How many genes were significantly differentially expressed? What thresholds did you use and why?
# create a matrix 
expressionMatrix <- as.matrix(normalized_counts[,1:8])
# Set row and column names
rownames(expressionMatrix) <- rownames(normalized_counts)
colnames(expressionMatrix) <- colnames(normalized_counts)[1:8]
# Create a minimal ExpressionSet object to use with limma
minimalSet <- ExpressionSet(assayData=expressionMatrix)

# Fit a linear model
fit <- lmFit(normalized_counts, design)

# Use eBayes to estimate the posterior distribution of log-fold changes
# and calculate moderated t-statistics for each gene
fit2 <- eBayes(fit, trend=TRUE)

# Get the top differentially expressed genes
top_genes <- topTable(fit2,  
                   coef=ncol(design),
                   adjust.method = "BH",
                   number = nrow(expressionMatrix))

# merge gene ids to topfit table
output_hits <- merge(rownames(normalized_counts),
                     top_genes,
                     by.y=0,by.x=1,
                     all.y=TRUE)
# sort by p-value
output_hits <- output_hits[order(output_hits$P.Value),]

# inspect
kable(output_hits[1:5,1:7],type="html",row.names = FALSE)
x logFC AveExpr t P.Value adj.P.Val B
Smo 1703.9076 1389.3918 49.100727 0.0e+00 0.0000000 -4.445771
Plxna4 -292.1647 433.8474 -13.459112 1.0e-07 0.0005542 -4.453819
Chd3 954.4752 4525.8799 9.104120 3.8e-06 0.0083446 -4.463050
Spock2 -772.6805 2987.0129 -9.100051 3.8e-06 0.0083446 -4.463064
Dock11 467.8391 1359.7519 8.960042 4.4e-06 0.0083446 -4.463567
# Rename the column name of gene
colnames(output_hits)[1] <- "gene"

# calculate the number of gene pass the threshold p-value < 0.05
length(which(output_hits$P.Value < 0.05))
## [1] 1027
  • We can observe that 1027 genes pass the threshold p-value < 0.05.
  • A threshold of p-value < 0.05 is used to determine significant differential expression. This is a commonly used threshold in statistical analysis, as it indicates that there is less than a 5% chance that the observed differences in gene expression are due to random chance.
  1. Multiple hypothesis testing - correct your p-values using a multiple hypothesis correction method. Which method did you use? And Why? How many genes passed correction?
# How many genes pass correction (multipole hypothesis testing) ?
length(which(output_hits$adj.P.Val < 0.05))
## [1] 49
  • The choice of multiple hypothesis testing method will be the Benjamini-Hochberg method to controls the false discovery rate (FDR) given that it the gene expression data of the whole cell and genes are independent or weakly correlated.Benjamini-Hochberg method are appropriate for controlling the FDR in large-scale hypothesis testing.
  • 49 genes passed this correction.
  1. Show the amount of differentially expressed genes using an MA Plot or a Volcano plot. Highlight genes of interest.
  • Volcano Plot is used for visualization. The genes of interest with a high log fold change values are labeled in blue.
# Create a vector of colors for each point in the volcano plot
colors <- ifelse(output_hits$P.Value < 0.05 & output_hits$logFC > 1, "red", 
         ifelse(output_hits$P.Value < 0.05 & output_hits$logFC < -1, "green", "grey"))

# Plot the volcano plot with colored points
plot(output_hits$logFC, -log10(output_hits$P.Value), pch=20, col=colors, 
     xlab="log2(Fold Change)", ylab="-log10(P-value)", 
     main="Volcano Plot of Smo(flox/-) mutants and wild type")

# Add legend for the colors
legend("topright", legend=c("Up-regulated", "Down-regulated", "Non-significant"), 
       col=c("red", "green", "grey"), pch=20, cex=0.8)

# Label genes with over top 10 logFC
# Subset top 10 genes with largest absolute logFC values
top10_genes <- subset(output_hits, 
                      abs(logFC) > sort(abs(logFC), decreasing = TRUE)[10])

# Add text labels to the top 10 logFC points in volcano plot
with(output_hits, {
  text(x = logFC, 
       y = -log10(P.Value), 
       labels = ifelse(gene %in% top10_genes$gene, gene, ""), 
       pos = 4, cex = 0.5, srt=45, col="blue")
})
Figure 3. Volcano plot for differential gene expression analysis between 4 Smo(flox/-) mutants(MUT) and 4 wild type (WT) samples. Each point represents a gene, with red points indicating upregulated genes, green points indicating down-regulated genes, and grey points indicating non-significant genes. The gene of interest, Smo, is highlighted in blue. The plot shows significant differential expression of several genes, with Smo showing a log2 fold change of X and a p-value of Y

Figure 3. Volcano plot for differential gene expression analysis between 4 Smo(flox/-) mutants(MUT) and 4 wild type (WT) samples. Each point represents a gene, with red points indicating upregulated genes, green points indicating down-regulated genes, and grey points indicating non-significant genes. The gene of interest, Smo, is highlighted in blue. The plot shows significant differential expression of several genes, with Smo showing a log2 fold change of X and a p-value of Y

  1. Visualize your top hits using a heatmap. Do you conditions cluster together? Explain why or why not.
  • To find the most significantly differentially expressed, we select the genes that pass a P-value threshold of 0.05, and has absolute value of LFC larger than 2 as the top hits.
# Get top hit genes based on p-value and log fold change
top_hits <- output_hits$gene[output_hits$P.Value < 0.05 & abs(output_hits$logFC) > 2]

# Calculate logCPM values for all genes
hm_matrix <- log2(normalized_counts + 1) # Add 1 to avoid log transformation of zero counts.

# Subset matrix to include only top hit genes
heatmap_matrix_tophits <- t(scale(
    t(hm_matrix[which(rownames(hm_matrix) %in% top_hits),]), 
    center = TRUE, scale = TRUE))

# Choose colors for heatmap
heatmap_color <- colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)),
                            c("#00FFFF", "white", "#FF00BF"))

# Create heatmap with dendrograms for genes and samples
current_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
                           show_row_dend = TRUE,
                           show_column_dend = TRUE,
                           col = heatmap_color,
                           show_column_names = TRUE,
                           show_row_names = FALSE,
                           show_heatmap_legend = TRUE)

{r} heatmap, out.width = "90%", echo = FALSE, fig.cap="Figure 4. Heatmap representation of gene expression analysis between 4 Smo(flox/-) mutants (MUT) and 4 wild type (WT) samples, showing log2 counts per million (logCPM) values for top hit genes identified by differential expression analysis. Rows represent genes and columns represent samples, with pink indicating higher expression and cyan indicating lower expression. The dendrograms on the top and left show clustering of genes and samples, respectively. The heatmap highlights distinct patterns of gene expression between the Smo(flox/-) mutants and control samples, with several key genes showing significant up- or down-regulation."} # Show the heatmap current_heatmap * The heatmap does has a clear clustering of gene expression values that are either up-regulated genes or the down-regulated genes, which suggests a evidence of differentially expressed genes.

Thresholded over-representation analysis

Run a thresholded gene set enrichment analysis (GSEA) with your significantly up-regulated and down-regulated set of genes using the g:Profiler.

Preparation of up-regulated and down-regulated

# Create a vector of gene symbols for the up-regulated and down-regulated genes
upregulated <- output_hits[output_hits$logFC > 0 & output_hits$P.Value < 0.05, ]
downregulated <- output_hits[output_hits$logFC < 0 & output_hits$P.Value < 0.05, ]

upregulated_genes <- upregulated$gene
downregulated_genes <- downregulated$gene

Perform gene set enrichment analysis

  • Use the gprofiler2 function to perform gene set enrichment analysis
  • Note that we were using the organism = mmusculus instead of hsapiens, given that the gene data is collected from house mice
# analyze for all differentially expressed genes
enrich <- gost(query = output_hits$gene, organism = "mmusculus", 
                           ordered_query = TRUE, 
                     measure_underrepresentation = TRUE, 
                     correction_method = "fdr",
                     sources = c("GO:BP", "REAC", "WP"))
# analyze for up-regulated differentially expressed genes
enrich_upreg <- gost(query = upregulated_genes, organism = "mmusculus", 
                           ordered_query = TRUE, 
                     measure_underrepresentation = TRUE, 
                     correction_method = "fdr",
                     sources = c("GO:BP", "REAC", "WP"))
# analyze for down-regulated differentially expressed genes
enrich_downreg <- gost(downregulated_genes, organism = "mmusculus", 
                           ordered_query = TRUE, 
                     measure_underrepresentation = TRUE, 
                     correction_method = "fdr",
                     sources = c("GO:BP", "REAC", "WP"))
# extract the enriched gene sets with a p-value < 0.05
enriched <- enrich$result[enrich$result$p_value < 0.05, ]
enriched_upreg <- enrich_upreg$result[enrich_upreg$result$p_value < 0.05, ]
enriched_downreg <- enrich_downreg$result[enrich_downreg$result$p_value < 0.05, ]
# inspect top 5
knitr::kable(head(enriched[,3:11], 5), format = "html", 
             table.attr = 'style="width: 30%; border-collapse: separate; border-spacing: 15px;"')
p_value term_size query_size intersection_size precision recall term_id source term_name
0 27334 10 9 0.9000000 0.0003293 GO:0008150 GO:BP biological_process
0 1508 10495 30 0.0028585 0.0198939 GO:0007606 GO:BP sensory perception of chemical stimulus
0 1288 10716 21 0.0019597 0.0163043 GO:0009593 GO:BP detection of chemical stimulus
0 1196 10495 17 0.0016198 0.0142140 GO:0007608 GO:BP sensory perception of smell
0 1338 10674 44 0.0041222 0.0328849 GO:0050906 GO:BP detection of stimulus involved in sensory perception
knitr::kable(head(enriched_upreg[,3:11], 5), format = "html", 
             table.attr = 'style="width: 30%; border-collapse: separate; border-spacing: 15px;"')
p_value term_size query_size intersection_size precision recall term_id source term_name
0.0e+00 27334 8 7 0.8750000 0.0002561 GO:0008150 GO:BP biological_process
0.0e+00 3625 585 13 0.0222222 0.0035862 GO:0006396 GO:BP RNA processing
1.4e-06 1572 455 2 0.0043956 0.0012723 GO:0022613 GO:BP ribonucleoprotein complex biogenesis
1.7e-06 1633 507 4 0.0078895 0.0024495 GO:0006397 GO:BP mRNA processing
5.8e-06 1449 536 4 0.0074627 0.0027605 GO:0000375 GO:BP RNA splicing, via transesterification reactions
knitr::kable(head(enriched_downreg[,3:11], 5), format = "html", 
             table.attr = 'style="width: 30%; border-collapse: separate; border-spacing: 15px;"')
p_value term_size query_size intersection_size precision recall term_id source term_name
0.0000000 27334 155 154 0.9935484 0.0056340 GO:0008150 GO:BP biological_process
0.0004051 2350 395 9 0.0227848 0.0038298 GO:0007186 GO:BP G protein-coupled receptor signaling pathway
0.0000000 8407 3 2 0.6666667 0.0002379 REAC:0000000 REAC REACTOME root term
0.0000000 1619 427 31 0.0725995 0.0191476 REAC:R-MMU-168256 REAC Immune System
0.0000000 1559 411 30 0.0729927 0.0192431 REAC:R-MMU-392499 REAC Metabolism of proteins
# inspect top one from three data sources for all genes
knitr::kable(rbind(enriched[enriched$source == "GO:BP",][1,9:13],
                   enriched[enriched$source == "REAC",][1,9:13],
                   enriched[enriched$source == "WP",][1,9:13]),
             format = "html")
term_id source term_name effective_domain_size source_order
1 GO:0008150 GO:BP biological_process 27334 3212
73 REAC:R-MMU-173623 REAC Classical antibody-mediated complement activation 8407 293
107 WP:WP33 WP Fatty acid omega-oxidation 4529 25
# inspect top one from three data sources for up-regulated genes
knitr::kable(rbind(enriched_upreg[enriched_upreg$source == "GO:BP",][1,9:13],
                   enriched_upreg[enriched_upreg$source == "REAC",][1,9:13],
                   enriched_upreg[enriched_upreg$source == "WP",][1,9:13]),
             format = "html")
term_id source term_name effective_domain_size source_order
1 GO:0008150 GO:BP biological_process 27334 3212
15 REAC:0000000 REAC REACTOME root term 8407 1718
56 WP:000000 WP WIKIPATHWAYS 4529 1
# inspect top one from three data sources for down-regulated genes
knitr::kable(rbind(enriched_downreg[enriched_downreg$source == "GO:BP",][1,9:13],
                   enriched_downreg[enriched_downreg$source == "REAC",][1,9:13],
                   enriched_downreg[enriched_downreg$source == "WP",][1,9:13]),
             format = "html")
term_id source term_name effective_domain_size source_order
1 GO:0008150 GO:BP biological_process 27334 3212
3 REAC:0000000 REAC REACTOME root term 8407 1718
21 WP:000000 WP WIKIPATHWAYS 4529 1

The top term returned in each data source for all genes: * GO:BP: biological_process * REAC: Classical antibody-mediated complement activation * WP: Fatty acid omega-oxidation

The top term returned in each data source for both up and down regulated genes: * GO:BP: biological_process * REAC: REACTOME root term
* WP: WIKIPATHWAYS

Visual Representation of the GSEA results

# Three interative figures were generate by g:profiler::gostplot
# Generate a Manhattan plot to visualize the distribution of the top genesets from each data source.
gostplot(enrich) %>% plotly::layout(title = "Manhattan plot for genes", font = list(size = 10))
# Generate a Manhattan plot to visualize the distribution of the top genesets from each data source.
gostplot(enrich_upreg) %>% plotly::layout(title = "Manhattan plot for Upregulated genes", font = list(size = 10))
# Generate a Manhattan plot to visualize the distribution of the top genesets from each data source.
gostplot(enrich_downreg) %>% plotly::layout(title = "Manhattan plot for Downregulated genes", font = list(size = 10))

Figure 5. Manhattan plot displaying the distribution of the top genesets from each data source for downregulated genes of 4 Smo(flox/-) mutants (MUT) and 4 wild type (WT) samples. A extensive number of REAC genesets is returned compared to GP:BP and WP, and genesets of biological_process, REACTOME root term, and WIKIPATHWAYS are exceeding 16 -log10 correction threshold.

Figure 6. Manhattan plot displaying the distribution of the top genesets from each data source for up-regulated genes of 4 Smo(flox/-) mutants (MUT) and 4 wild type (WT) samples. A extensive number of GP:BP genesets is returned compared to REAC and WP, and the genesets of RNA processing, biological_process, Metabolism, REACTOME root term, and WIKIPATHWAYS are exceeding 16 -log10 correction threshold.

Figure 7. Manhattan plot displaying the distribution of the top genesets from each data source for downregulated genes of 4 Smo(flox/-) mutants (MUT) and 4 wild type (WT) samples. A extensive number of REAC genesets is returned compared to GP:BP and WP, and genesets of biological_process, REACTOME root term, and WIKIPATHWAYS are exceeding 16 -log10 correction threshold.

Thresholded over-representation discussion

  1. Which method did you choose and why?
  • The method used is the g:Profiler for the over-representation analysis since we have use it during the lecture. g:Profiler supports a wide range of annoatation sources, including Gene Ontology (GO), KEGG pathways, Reactome, WikiPathways. This allows for comprehensive analysis and interpretation of genes. g:Profiler is also continuously updated, which ensures that the analysis is based on the most up-to-date and relevant information.
  1. What annotation data did you use and why? What version of the annotation are you using?
  • Three sources of annotations, Gene Ontology Biological Process (GO:BP), Reactome (REAC), and WikiPathways (WP) are used for this analysis. The organism is specified as “mmusculus”, since the analysis is being performed on mouse genes for simulating human physiology process. The version of the annotation database is the most current version available at the time of running the analysis, as the gost function retrieves annotation data from the g:Profiler web service, which is regularly updated.
  1. How many genesets were returned with what thresholds?
nrow(enriched)
## [1] 107
nrow(enriched_upreg)
## [1] 69
nrow(enriched_downreg)
## [1] 31
  • The threshold used is p < 0.05.
  • For all genesets, 107 genesets were return.
  • For the up-regulated genes, 69 genesets were returned.
  • For the down-regulated genes, 31 genesets were returned.
  1. Run the analysis using the up-regulated set of genes, and the down-regulated set of genes separately. How do these results compare to using the whole list (i.e all differentially expressed genes together vs. the up-regulated and down regulated differentially expressed genes separately)?
  • The results performed on the up-regulated set of genes, and the down-regulated set of genes provided a more clear insight into the biological processes that are specifically affected by each group of genes. The analysis of the up-regulated genes and the down-regulated genes separately returned fewer gene sets than the analysis using all differentially expressed genes together, which suggests that the up-regulated and down-regulated genesets capture different biological processes and pathways that are not fully represented when all differentially expressed genes are analyzed together.
  • The number of gene sets of the up-regulated genes is more than the down-regulated genes, which indicates more biological processes or pathways are overrepresented among the up-regulated genes compared to the down-regulated genes.

Interpretation

  1. Do the over-representation results support conclusions or mechanism discussed in the original paper?
  1. Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your results.

References